A Parallel N -Body Data Mining Framework
نویسندگان
چکیده
The N-body or multi-tree approach for accelerating data mining methods has spurred some of the fastest known solutions for a significant class of fundamental methods. We present a standard mathematical model and associated programming model that allows these problems to be scaled further via parallelization, without significant extra programmer effort. With the framework, we derive a strategy for multi-core and cluster parallelism and create an implementation, Tree-based High-Order Reduce or THOR. We demonstrate three realistic parallel applications: kernel density estimation on the Sloan Digital Sky Survey, two-point correlation of galactic halo positions, and nearest neighbor search on high-dimensional speech data. Many powerful nonparametric methods that take full advantage of massive data require analyzing all pairs of distances ; however, such analysis is time consuming. For a significant subclass of these problems, which we call generalized N-body problems (GNP's), a higher-order divide-and-conquer technique enumerated in [6] can avoid nearly all computation by skipping combinations of regions with little mutual interaction. Higher-order divide and conquer on metric data typically utilizes space-partitioning trees, such as kd-trees [4], ball trees [19], and cover trees [1], to assist in dividing the data into regions. This use of trees is called the dual-tree or the broader multi-tree approach. This approach generalizes long-used algorithms such as kd-tree-based nearest-neighbor search [5], as well as more recent algorithms using space-partitioning trees for N-body physical simulation such as the celebrated Fast Multipole Method (FMM) [8] and for statistical problems like kernel regression [20]. For nearest-neighbor search and kernel regression, the dual-tree approach generalizes these earlier algorithms by considering all query points simultaneously, by organizing them in a second tree. The multi-tree approach generalizes the FMM by casting it recursively, allowing arbitary tree structures, recursion patterns, approximation schemes, and error criteria. This approach is applied to many statistical and physical problems on metric data, such as bounded-error kernel density estimation [6, 7, 15, 14], k-nearest-neighbor classification [17], kernel discriminant analysis [21], nearest neighbor search [6], n-point correlation [6, 18], and more. For each of these problems, no overall faster serial algorithms are known. Other machine learning problems treated with this strategy include dimensional-ity reduction methods [9], nonparametric belief propagation [10], linear algebraic machine learning methods [2], and particle filters [13]. Many current and future data mining techniques remain to be considered in depth with this strategy, not to mention a large array of related fundamental problems in computational geometry, physics, and …
منابع مشابه
Non-zero probability of nearest neighbor searching
Nearest Neighbor (NN) searching is a challenging problem in data management and has been widely studied in data mining, pattern recognition and computational geometry. The goal of NN searching is efficiently reporting the nearest data to a given object as a query. In most of the studies both the data and query are assumed to be precise, however, due to the real applications of NN searching, suc...
متن کاملA Genetic Programming Framework for Two Data Mining Tasks: Classification and Generalized Rule Induction
This paper proposes a genetic programming (GP) framework for two major data mining tasks, namely classification and generalized rule induction. The framework emphasizes the integration between a GP algorithm and relational database systems. In particular, the fitness of individuals is computed by submitting SQL queries to a (parallel) database server. Some advantages of this integration from a ...
متن کاملA Parallel Scalable Infrastructure for OLAP and Data Mining
Decision support systems are important in leveraging information present in data warehouses in businesses like banking, insurance, retail and health-care among many others. The multi-dimensional aspects of a business can be naturally expressed using a multi-dimensional data model. Data analysis and data mining on these warehouses pose new challenges for traditional database systems. OLAP and da...
متن کاملFramework for Social Network Data Mining
Social networks have become a vital component in personal life. People are addicted to social network features, updating their profile page and collaborating virtually with other members have become daily routines. Social networks contain massive collection of data. Web data mining is a new trend in the current research body. This conceptual paper introduces a framework that can be used to mine...
متن کاملA General Framework for Knowledge Discovery Using High Performance Machine Learning Algorithms
Abstract—The aim of this paper is to propose a general framework for storing, analyzing, and extracting knowledge from two-dimensional echocardiographic images, color Doppler images, non-medical images, and general data sets. A number of high performance data mining algorithms have been used to carry out this task. Our framework encompasses four layers namely physical storage, object identifica...
متن کامل